# CAR COMPANY
# Problem Solving Phase / Notes: Read these carefully
This is a **Python Jupyter Notebook** containing both code and rich text elements, such as figures, links, equations etc. The notebook is generally split into the following sections:

1. **Initial set of pre-filled cells**, that you should evaluate (run) just to load some Python modules (packages), the dataset required for your task and its variables in memory.
2. **Description of a concrete task** associated with the dataset. 
3. **Final section (with one or more empty cells)** where you can perform analyses with the loaded dataset (e.g., write a few lines of code if needed), answer the question posed, and describe your reasoning in words.

Read and execute each cell in order, without skipping forward. To execute any cell, place your cursor in the cell and either click the play button on the top left corner of the notebook, or, press Shift+Enter on your keyboard. It might take a couple of seconds to receive an output. 

Have fun!

In [None]:
#Run the following to import necessary packages and import dataset. Do not use any additional plotting libraries.
import pandas as pd
import numpy as np
from pandas.plotting import parallel_coordinates
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')

d1 = "AQ_phase1_dataset1.csv"
d2 = "AQ_phase1_dataset2.csv"
d3 = "AQ_phase1_dataset3.csv"
d4 = "AQ_phase1_dataset4.csv"

df1 = pd.read_csv(d1)
df2 = pd.read_csv(d2)
df3 = pd.read_csv(d3)
df4 = pd.read_csv(d4)

df1_copy=df1.copy()
df2_copy=df2.copy()
df3_copy=df3.copy()
df4_copy=df4.copy()
df1_copy['Input_Dataset_name'] = 'Dataset 1 (Company A)'
df2_copy['Input_Dataset_name'] = 'Dataset 2 (Company B)'
df3_copy['Input_Dataset_name'] = 'Dataset 3 (Company C)'
df4_copy['Input_Dataset_name'] = 'Dataset 4 (Company D)'

#Print first five lines of dataset 1 as a check to see if the datasets are loaded properly.
df1.head(n=5)

# DATASET DESCRIPTION
Each of the 4 dataframes loaded above represents the **total number of units sold** (in 100’s) and **employee satisfaction** (on a scale of 1 to 100) from 182 sites all over the world for car companies 1, 2, 3 and 4. 

Run the cells below to obtain some descriptive (numerical) statistics and a parallel coordinates visualization for these datasets. 

1. **Median** is a measure of central tendency that separates the higher half from the lower half of a data sample.

2. **Interquartile range (IQR)** is a measure of variability (statistical dispersion), based on dividing a data set into quartiles. Quartiles divide a rank-ordered data set into four equal parts. IQR is equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles.

3. **Spearman's correlation** measures the strength and direction of monotonic association between two variables. A monotonic relationship is a relationship that does one of the following: (1) as the value of one variable increases, so does the value of the other variable; or (2) as the value of one variable increases, the other variable value decreases. 

4. **Parallel coordinates** is a plotting technique for multivariate data (allows one to estimate some descriptive statistics visually). Here, data points are represented as connected line segments. Each vertical line represents one data attribute. One complete set of connected line segments across all the attributes represents one data point.

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
#You will receive 4 outputs: median for both variables, inter-quartile range for both variables, Spearman's correlation between the variables, and a parallel coordinates visualization

#CAR COMPANY 1
print ("Median")
round(df1.median(),2)
print (" ")
print ("----")

print ("Interquartile range")
round((df1.quantile(q=0.75) - df1.quantile(q=0.25)),2)
print (" ")
print ("----")

print ("Spearman correlation")
round(df1.corr(method='spearman'),2)
print (" ")
print ("----")

print ("Parallel coordinates visualization")
parallel_coordinates(df1_copy, 'Input_Dataset_name')
plt.show()

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
#You will receive 4 outputs: median for both variables, inter-quartile range for both variables, Spearman's correlation between the variables, and a parallel coordinates visualization

#CAR COMPANY 2
print ("Median")
round(df2.median(),2)
print (" ")
print ("----")

print ("Interquartile range")
round((df2.quantile(q=0.75) - df2.quantile(q=0.25)),2)
print (" ")
print ("----")

print ("Spearman correlation")
round(df2.corr(method='spearman'),2)
print (" ")
print ("----")

print ("Parallel coordinates visualization")
parallel_coordinates(df2_copy, 'Input_Dataset_name')
plt.show()

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
#You will receive 4 outputs: median for both variables, inter-quartile range for both variables, Spearman's correlation between the variables, and a parallel coordinates visualization

#CAR COMPANY 3
print ("Median")
round(df3.median(),2)
print (" ")
print ("----")

print ("Interquartile range")
round((df3.quantile(q=0.75) - df3.quantile(q=0.25)),2)
print (" ")
print ("----")

print ("Spearman correlation")
round(df3.corr(method='spearman'),2)
print (" ")
print ("----")

print ("Parallel coordinates visualization")
parallel_coordinates(df3_copy, 'Input_Dataset_name')
plt.show()

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
#You will receive 4 outputs: median for both variables, inter-quartile range for both variables, Spearman's correlation between the variables, and a parallel coordinates visualization

#CAR COMPANY 4
print ("Median")
round(df4.median(),2)
print (" ")
print ("----")

print ("Interquartile range")
round((df4.quantile(q=0.75) - df4.quantile(q=0.25)),2)
print (" ")
print ("----")

print ("Spearman correlation")
round(df4.corr(method='spearman'),2)
print (" ")
print ("----")

print ("Parallel coordinates visualization")
parallel_coordinates(df4_copy, 'Input_Dataset_name')
plt.show()

# TASK
Design **as many measures** to **rank order** the datasets from the **most successful** to the **least successful** car company. Your measures should be based on consideration of every data point in the datasets. We expect you to **generate multiple measures**.

For **each measure that you design**: 

1. Select and print the resulting **dataset ordering** (e.g., 1234, 2134 etc)
2. Write and print a **reasoning** behind your answer selection (an explanation of **why** you took certain steps or performed certain calculations to get to the solution)
3. Select and print how you **used information about the descriptive statistics** (obtained by running the cells above) in reasoning about your answer
4. Select and print your **confidence** in the designed measure

**MAKE SURE** to print all four options for each measure.

# Important note about designing your measures
Below is a **template for a cell** where you can design a measure. To create a new measure:

1. Add a new code cell below the template cell (Click on **"Insert a cell below"** option at the **top right corner** of the cell)
2. Copy all contents of the template cell to this newly added code cell. 
3. Use this newly added code cell to change your answers corresponding to the created measure. 

Follow a similar process to add new cells for creating as many measures as you are able to (within the allotted time). 



In [None]:
#Template for designing a measure. 

#NOTE: Round all your statistics to 2 decimal places before reasoning with them!! 

#["1234", "1243", "1324", "1342", "1423", "1432", "2134", "2143", "2314", "2341", "2413", "2431", "3124", "3142", "3214", "3241", "3412", "3421", "4123", "4132", "4213", "4231", "4312", "4321"]
#Choose one string from the list above, replacing 'None' below, as the order of the car company
carcompany_ordering_measure = 'None'
print(carcompany_ordering_measure)

#Write your reasoning here, replacing 'None' below
carcompany_ordering_reasoning_measure = 'None'
print(carcompany_ordering_reasoning_measure)

option_a = "I found the descriptive statistics HELPFUL in designing the measure (my measure is BASED ON one or more of them)"
option_b = "I found the descriptive statistics HELPFUL in designing the measure (my measure is NOT BASED ON them, but I still found them helpful in reasoning about what measures might or might not work for rank ordering the datasets)"
option_c = "I found the descriptive statistics NOT HELPFUL in designing the measure (my measure is NOT BASED ON them and I did not find information from them to be convincing enough to answer the task)"
option_d = "I found the descriptive statistics NOT HELPFUL in designing the measure (my measure is BASED ON them, but I did not find information from them to be convincing enough to answer the task)"
#Choose one of the four options above, replacing 'None' below
used_descriptive_statistics_in_reasoning_measure = 'None'
print(used_descriptive_statistics_in_reasoning_measure)

#Assign a value out of "1-2-3-4-5" to denote your confidence in this created measure
confidence_measure = 1
print(confidence_measure)

In [None]:
#ONLY use this space below to write your code (if needed) for any measure you generate. DO NOT ERASE this code segment from the workbook.













#Your intuitive ideas are valuable and we don't want you to stumble in designing your measure because of forgetting minor syntax-related details!! 
#Therefore, in case you have an idea and need any help with syntax in implementing it, you can access the following documentation files (use the "Search" tab for queries) and/or summarized syntax sheets.

#a) Pandas library
#Documentation file: https://pandas.pydata.org/pandas-docs/stable/
#Syntax sheet: https://datacamp-community-prod.s3.amazonaws.com/fbc502d0-46b2-4e1b-b6b0-5402ff273251

#b) Numpy library
#Documentation file: https://docs.scipy.org/doc/numpy/user/index.html
#Syntax sheet: https://datacamp-community-prod.s3.amazonaws.com/e9f83f72-a81b-42c7-af44-4e35b48b20b7

#c) Matplotlib library
#Documentation file: https://matplotlib.org/contents.html
#Syntax sheet: https://datacamp-community-prod.s3.amazonaws.com/28b8210c-60cc-4f13-b0b4-5b4f2ad4790b

#d) Scipy library
#Documentation file: https://docs.scipy.org/doc/scipy/reference/
#Syntax sheet: https://datacamp-community-prod.s3.amazonaws.com/5710caa7-94d4-4248-be94-d23dea9e668f


# Instruction Phase

Please **complete the Problem Solving phase** before moving to this phase.

### Click the link to watch the first video (4 mins)
[Instruction Phase - Part 1](https://www.youtube.com/embed/zxh4JnVCwss?si=c7PN1ZqxDUHd6xzz)

### Click the link to watch the second video (11 mins)
[Instruction Phase - Part 2](https://www.youtube.com/embed/VzNpSVnSbhg?si=dsxKEvcHWxDvFIp-)

### Click the link to watch the third video (4 mins)
[Instruction Phase - Part 3](https://www.youtube.com/embed/bryiXWN3owI?si=B0wIgj7tyQ1yShY3)