## Test Instructions

This Jupyter Notebook is presented as a guide to assist in answering the questions made available on the Moodle Platform of Universidade Lusófona.

Submit this file on the platform in the designated field once you have completed the test.

### Test Rules:

- The questions to be evaluated on Moodle are worth between 0 and 1.
- One or more alternatives may be correct.
- Selecting an incorrect alternative cancels out a correct one.
- Even if only incorrect alternatives are selected, the maximum penalty is a score of 0 for that question.
- Do not forget to submit the completed Jupyter Notebook file.

### Code of Conduct:

- **Not allowed:**
    - consulting class materials.
    - consulting other classmates.
    - using electronic devices such as smartphones, smartwatches, etc.
    - using instant messaging applications during the test.

- **Allowed:**
    - use of generative AI


In [1]:
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

---

 ## Part 1: Corporate Dataset

In [2]:
#load dataset
df = pd.read_csv("hf://datasets/Mursalin295/Ecommerce_Product_Reviews/Ecommerce_Product_Reviews.csv")

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
df.head()

Unnamed: 0,employee_id,hire_date,age,department,job_level,employment_type,monthly_salary_usd,performance_score,remote_status,years_at_company
0,EMP20001,2018-03-12,29,Engineering,Mid,FullTime,4200,4.2,Hybrid,6
1,EMP20002,2020-07-01,34,Marketing,Senior,FullTime,5100,4.5,Remote,4
2,EMP20003,2016-11-20,41,Finance,Lead,FullTime,6800,4.7,Onsite,8
3,EMP20004,2022-01-10,26,Design,Junior,Contract,2800,3.9,Remote,2
4,EMP20005,2019-05-18,37,Engineering,Senior,FullTime,5900,4.6,Hybrid,5


| Column              | Description                                                              |
|---------------------|--------------------------------------------------------------------------|
| monthly_salary_usd  | Employee’s monthly salary in U.S. dollars (USD).                         |
| employee_id         | Unique identifier of the employee.                                       |
| hire_date           | Date the employee was hired by the company.                              |
| age                 | Employee’s age (in years).                                               |
| department          | Department where the employee works (e.g., Engineering, Marketing).      |
| job_level           | Hierarchical job level (Junior, Mid, Senior, Lead, Director).            |
| employment_type     | Type of employment contract (Full-Time, Contract, etc.).                 |
| performance_score   | Employee performance evaluation score (continuous scale).               |
| remote_status       | Work arrangement (Onsite, Hybrid, or Remote).                            |
| years_at_company    | Total time the employee has been with the company, in years.             |


### Q1.1 – Is there a relationship between monthly salary and years at the company? Which other numerical variables also show correlation with monthly salary? Build simple regression models between them to justify your answer.


### Q1.2 – Which department pays more on average than the others?

### Q1.3 – Study the relationship between the length of time an employee has been at the company and their job level. Explain, using a linear regression, which coefficient relates job levels to years at the company.

### Q1.4 – What is the best regression model for the dataframe above to explain monthly salary? Start with all variables and progressively remove those with the highest p-values until all remaining variables are statistically significant (p-value < 0.05). Present the summary of each model and explain the variables that remained in the final model.


## Part 2: Marketing

In [4]:
#load dataframe marketing_data_fake.csv
df = pd.read_csv("marketing_data_fake.csv")
df

Unnamed: 0,User_ID,Name,Age,City,Membership,Device,Total_Spend,Is_Clicked
0,4d5463a9-5b77-4b54-a66d-22f161c85d29,Eileen Monroe,39,Charleston,Gold,Android,10348,1
1,2b7119d1-7a01-404a-9e7f-20a4f281ba33,Clayton Alexander,33,North Scottstad,Basic,Android,1040,0
2,f8794d24-27eb-4c00-ba1f-69fac37bb5a0,Mary Castaneda,30,Ericashire,Basic,iOS,921,0
3,e604d5df-35af-405b-8cb6-4b01099958c8,Richard Hernandez,40,West Jill,Silver,iOS,5223,1
4,540ae2ce-e85c-45e0-85fa-d1e787986621,James Hill,25,Micheleview,Silver,Android,4399,0
...,...,...,...,...,...,...,...,...
495,d57c15e2-8c1b-47f4-a337-f17f62e07a8c,Cynthia Lopez,18,Judystad,Gold,iOS,9600,1
496,4b680e6d-1cc2-4eb8-ba2d-3d0c1f085ebc,John Stafford,31,South Julia,Gold,iOS,10784,0
497,163eec3f-9e72-4b75-b225-8a40b2afce32,Ryan Friedman,36,Lake Josephburgh,Basic,iOS,944,1
498,5150afd5-db2b-42b4-af10-d9d392d9c0be,Laura Lewis,30,Hillton,Gold,Android,11682,0


### Q2.1 – Is there a relationship between the variable `City` and the variable `Is_Clicked`?


### Q2.2 – Regarding the other variables, are there significant correlations with the variable `Is_Clicked`? Build a logistic regression model to justify your answer. Use `test_size=0.2` and `random_state=42`.


### Q2.3 – Create separate logistic regression models for each type of **Membership**. Use `test_size=0.2` and `random_state=42`. What differences do you find among the models?

## Part 3: Ecommerce

In [6]:
#read ecommerce_cargo_dataset.csv
df = pd.read_csv("ecommerce_cargo_dataset.csv")
df

Unnamed: 0,Cargo_ID,Order_ID,Length_cm,Width_cm,Height_cm,Weight_kg,Category,Fragile,Destination_Zone,Priority,Stackable,Sorting_Sequence,Robotic_Pick_Position_X,Robotic_Pick_Position_Y,Robotic_Pick_Position_Z
0,CARGO_0001,ORDER_355,12.9,33.0,6.2,4.28,Electronics,True,Zone_D,3,False,32,46.0,30.5,4.0
1,CARGO_0002,ORDER_349,11.6,23.4,4.2,2.44,Books,False,Zone_A,5,True,89,4.9,25.1,1.4
2,CARGO_0003,ORDER_1116,7.4,10.4,5.4,0.81,Clothing,False,Zone_A,3,True,40,20.2,24.0,2.2
3,CARGO_0004,ORDER_434,19.8,12.0,8.3,4.38,Beauty,True,Zone_A,3,False,19,19.6,31.4,5.3
4,CARGO_0005,ORDER_135,19.1,20.1,3.5,2.39,Books,False,Zone_D,2,True,16,36.5,15.7,4.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3495,CARGO_3496,ORDER_771,26.2,29.0,11.5,15.72,Electronics,True,Zone_C,5,False,3,13.8,41.8,1.9
3496,CARGO_3497,ORDER_958,21.7,37.0,19.1,34.29,Electronics,True,Zone_B,1,False,29,5.1,43.7,6.9
3497,CARGO_3498,ORDER_403,29.5,20.4,16.9,20.20,Toys,False,Zone_D,4,True,76,39.8,37.4,5.0
3498,CARGO_3499,ORDER_1195,36.8,7.5,2.7,1.45,Sports,False,Zone_B,1,True,83,35.1,9.5,7.0


| Variable | Description |
|---------|-------------|
| Cargo_ID | A unique identifier for each cargo item in the dataset. |
| Order_ID | Identifier for the order to which the cargo belongs; multiple cargo items can share the same order. |
| Length_cm | The length of the cargo item measured in centimeters. |
| Width_cm | The width of the cargo item measured in centimeters. |
| Height_cm | The height of the cargo item measured in centimeters. |
| Weight_kg | The weight of the cargo item in kilograms. |
| Category | The type of cargo, e.g., Electronics, Clothing, Books, Toys, etc. |
| Fragile | Boolean value indicating whether the cargo item is fragile and requires careful handling. |
| Destination_Zone | Warehouse zone or shipping area where the cargo is designated to be sent. |
| Priority | Priority level for delivery or processing, usually on a scale from 1 (low) to 5 (high). |
| Stackable | Boolean indicating whether the cargo can be safely stacked on other items. |
| Sorting_Sequence | The optimal sequence number for sorting the cargo item in automated handling systems. |
| Robotic_Pick_Position_X | X-coordinate position for robotic handling or placement in the warehouse. |
| Robotic_Pick_Position_Y | Y-coordinate position for robotic handling or placement in the warehouse. |
| Robotic_Pick_Position_Z | Z-coordinate (height) position for robotic handling or placement in the warehouse. |


### Q3.1 – Is it possible, with the loaded data, to create a logistic regression model that predicts shipping priorities (`Priority`)? If necessary, apply one-hot encoding to categorical variables. Create a logistic regression model for each priority level (1 to 5). Use `test_size=0.2` and `random_state=42`. What is the accuracy of each model? Present the confusion matrix. Also consider a classification threshold of 0.4.

### Q3.2 – Reduce the variable `Priority` to a binary variable, where 0 represents low priorities (1, 2, 3) and 1 represents high priorities (4, 5). Create a logistic regression model for this new binary variable. Use `test_size=0.2` and `random_state=42`. What is the model’s accuracy? Present the confusion matrix.