# Regression Notebook

In this task, we will work on some multivariable regression problems.
The dataset we will be working on is a record of Diamond sales prices.

Remember that this is an artifical use case which is supposed to serve a **Contributor level** purposes. Thereafter, you should practice skills such as **Data Preprocessing, Data Visualisation, Preparing Data for ML** and only after you receive feedback on your initial work it is recommended to try **to fit some baseline Machine Learning Model**.

````
Goal: to find out which couple factors contribute the most to the price of a diamond. Show a strong correlation between the factors that you can come up with.
````
    
One thing to keep in mind is that the result that you get at the end is highly subjective. Try to have a good correlation score, be creative, and be ready to explain the reasoning behind your work.

## 1. Dataset Information
We have one `.csv` file inside the ``Data`` folder prepared for you.
    
Below is an explanation of our variables from the dataset taken directly from the [dataset source](https://www.kaggle.com/datasets/shivam2503/diamonds):

### Diamonds's attributes:
- `carat` weight of the diamond (0.2-5.01)
- `cut` quality of the cut (Fair, Good, Very Good, Premium, Ideal)
- `color` diamond colour, from J (worst) to D (best)
- `clarity` a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
- `x` length in mm (0-10.74)
- `y` width in mm (0-58.9)
- `z` depth in mm (0-31.8)
- `depth` total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43-79)
- `table` width of top of diamond relative to widest point (43-95)

### Output variable (desired target):
- `price` price in US dollars ($326-$18,823)

## 2. Importing the Dataset
A dataset can be imported directly from a `.zip` file.
To import a dataset, you will need to specify the file where is dataset is located.
The relative path below is correct for the location of this instruction file.
````python
import pandas as pd
import zipfile

zf = zipfile.ZipFile('Data/diamonds.zip') 
df = pd.read_csv(zf.open('diamonds.csv'))
````
This is specific to our repository.

## 3. Task: Regression
The steps below are served as a **guide** to solve this problem. They are by no means a must or the only way to solve this partcular dataset. Feel free to use what you have learned in the previous classrooms and to be creative. Try to find out your own approach to this problem.


**Step 1: Data Loading & Describing**
- load the dataset into your Python Notebook
- convert the dataset to the desired format that you want to work with (dataframe, numpy.array, list, etc.)
- explore the dataset
- observe the variables carefully, and try to understand each variable and its meaning

**Step 2: Data Visualisation & Exploration**
- employ various visualization techniques to understand the data even more thorough
- with the visualization tools, understand what is happening in the dataset

**Step 3: Data Modelling**
- separate variables & labels
- split dataset into training & testing dataset
- pick one data modelling approach respectively the Python modelling package that you would like to use
- fit the training dataset to the model and train the model
- output the model 
- make prediction on testing dataset

**Step 4: OPTIONAL Fine Tuning the Model For a Better Result**
- map the prediction of the testing dataset against real numbers from your dataset and compare the result
- make adjustments on your model for a better result (but make sure don't overfit the model)
    
**Step 5: Result Extration & Interpretation**
- make your conclusions and interpretation on the model and final results
- evaluate the performance of your model and algorithm using different KPIs <br>
- `bonus: ` use more visualization techniques to demonstrate the correlation between one or more variables to the happiness score
- `bonus: ` as well as the difference between your prediction and the actual score

**Note!** Important criteria for evaluating your use case are well-documented cells, a good structure of the notebook with headers which are depicting various parts of it, and short comments on each part with reflections and insights that you gained.

## 4. Additional Resources: 

**Packages that might be useful for you:**
- pandas: https://pandas.pydata.org/pandas-docs/stable/reference/index.html
- numpy: https://numpy.org/doc/
- scikit-learn: https://scikit-learn.org/stable/
- plotly: https://plotly.com/python-api-reference/
- lightGBM: https://lightgbm.readthedocs.io/en/latest/
- seaborn: https://seaborn.pydata.org/api.html

**Useful Links:**
- Die Pipeline: https://wiki.rbinternational.com/confluence/display/AAT/MGF+-+Die+Pipeline
- Scikit homepage: https://scikit-learn.org/stable/
- https://scikit-learn.org/
- https://seaborn.pydata.org/
- https://plotly.com/python/
- https://matplotlib.org/
- https://medium.com/pursuitnotes/multiple-linear-regression-model-in-7-steps-with-python-c6f40c0a527
- https://towardsdatascience.com/a-beginners-guide-to-linear-regression-in-python-with-scikit-learn-83a8f7ae2b4f

Dataset citation: https://www.kaggle.com/datasets/shivam2503/diamonds/code