# Project submission
**Due Friday May 16th before class.** Counts for 25% of the final course grade.

You should address all the questions relevant to your project.
You will not be graded based on the values of the model performance, but on whether or not you have applied the right methodology: formulated the business model, translated it into a right machine learning approach, analyzed your data, prepared it for modeling, applied at least 5 different machine learning algorithms, as well as neural networks, used cross validation for model tuning, justified your tuning metric, set up the proper machine learning pipeline without data leakage, evaluated your model using all the relevant metrics, interpreted your model and justified all your decisions.

If you have tried different approaches, please include them all, and not just the best one.
If doing some feature engineering has improved your model, also please include all of the steps, not just the most successful ones.

You should submit the notebook with the code, output and explanations. The notebook should be executable and comprehensible.

The points will be deducted for the following reasons:
- data leakage
- unjustified decisions (no discussion on: choice of metric for optimization, blind removal of features, blind removal of outliers...)
- notebook not comprehensible
- notebook with incomplete output
- notebook not executable
- blind copy pasting from ChatGPT, if the copied code is not suitable for the task
- writing your own code (or copy pasting them from outside source) for simple functions that we covered and that already exist in `sklearn` (train test split, plain grid search, encoding of categorical variables,...), as this leads to:
    - convoluted code prone to bugs
    - code that is hard to understand and review
    - waste of data scientist's time if ready-made simple functions exist

Additional points will be awarded for trying and testing different relevant approaches, from exploratory data analysis, to feature engineering, to modeling and evaluation.

There should be one submission per group, but team member evaluation can be submitted per person. If not submitted, the default is that all the team members have contributed equally to the project and should get the same grade.

### Group number:
### Student IDs:
### Project name:

## What business problem are you solving?
- Please state clearly what business problem are you solving. (one sentence)
- Elaborate why is this a relevant problem, and what can you do with the model output to create business value, i.e., how is the model output actionable. (2-3 paragraphs)

## What is the machine learning problem that you are solving?
- Please state clearly what is the ML problem. 
- If applicable state your target.

## Data exploration and preparation 

- How many data instances do you have?
- Do you have duplicates?
- How many features? What type are they?
- If they are categorical, what categories they have, what is their frequency?
- If they are numerical, what is their distribution?
- Do you have outliers, and do you need to do anything about them?
- What is the distribution of the target variable?
- If you have a target, you can also check the relationship between the target and the variables.
- Do you have missing data? If yes, how are you going to handle it?
- Can you use the features in their original form, or do you need to alter them in some way?
- What have you learned about your data? Is there anything that can help you in feature engineering or modeling?


## Feature engineering
Creating good features is probably the most important step in the machine learning process. 
This might involve doing:
- transformations
- aggregating over data points or over time and space, or finding differences (for example: differences between two monthly bills, time difference between two contacts with the client) 
- creating dummy (binary) variables
- discretization

Business insight is very relevant in this process. If it is possible you can also find additional relevant data.

## Modeling
You should implement AT LEAST FIVE approaches we covered, and tune of at least two hyperparameters of each approach.
Do not forget that you should split your data.
You should do model selection and tuning using cross validation on the train set, avoiding data leakage.
Explain and justify what is the metric you are using for model selection and tuning. If your data is imbalanced, consider using techniques for data balancing.

Separately, you should train a neural network. Visualize the training and validation loss. Discuss the network performance

In model selection, make sure when you compare different models and approaches that you compare them on the same dataset, though different transformations could be applied to the comparison dataset.

## Model evaluation

After selecting your final model, which could be a compromise of performance, interpretability and complexity, you should evaluate its performance on the test set. 
You might have tuned your model using a certain metric, but now you should describe the model performance using all relevant metrics. 
If you have some business insight, why a certain metric is relevant, you should explain it. 
Construct a suitable baseline to benchmark your result and to put them in the context.
Discuss your results, do they seem good enough to be used in practice? If not, what should be improved. Discuss what type of errors is your model making.


## Model interpretation

Use at least two different techniques for model interpretability. Discuss what are the most important features of your model, and how they impact the model performance. Pick a few examples of errors that your model is making, and check which features lead to thess errors.

In [49]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_colwidth', None)

In [87]:
df = pd.read_csv('pr13_stocks (1).csv', index_col=0)
df.head()

Unnamed: 0,Date,Dividends,Stock Splits,Brand_Name,Ticker,Industry_Tag,Country,Volume,Open,High,Low,Close
0,2021-01-25 00:00:00-05:00,0.0,0.0,crocs,CROX,footwear,usa,1102500.0,73.18,74.75,71.050003,73.910004
1,2019-09-12 00:00:00-04:00,0.0,0.0,target,TGT,retail,usa,3185700.0,100.810594,101.077753,99.999911,100.359192
2,2015-12-29 00:00:00-05:00,0.0,0.0,unilever,UL,consumer goods,netherlands,1278700.0,33.785507,33.970634,33.700657,33.908924
3,2014-06-13 00:00:00-04:00,0.0,0.0,amd,AMD,technology,usa,17734600.0,4.36,4.39,4.24,4.28
4,2017-10-06 00:00:00-04:00,0.0,0.0,the walt disney company,DIS,entertainment,usa,4360200.0,96.426167,96.734885,95.837672,96.541939


In [88]:
df['Date'] = pd.to_datetime(df['Date'], utc=True)

df

Unnamed: 0,Date,Dividends,Stock Splits,Brand_Name,Ticker,Industry_Tag,Country,Volume,Open,High,Low,Close
0,2021-01-25 05:00:00+00:00,0.0,0.0,crocs,CROX,footwear,usa,1102500.0,73.180000,74.750000,71.050003,73.910004
1,2019-09-12 04:00:00+00:00,0.0,0.0,target,TGT,retail,usa,3185700.0,100.810594,101.077753,99.999911,100.359192
2,2015-12-29 05:00:00+00:00,0.0,0.0,unilever,UL,consumer goods,netherlands,1278700.0,33.785507,33.970634,33.700657,33.908924
3,2014-06-13 04:00:00+00:00,0.0,0.0,amd,AMD,technology,usa,17734600.0,4.360000,4.390000,4.240000,4.280000
4,2017-10-06 04:00:00+00:00,0.0,0.0,the walt disney company,DIS,entertainment,usa,4360200.0,96.426167,96.734885,95.837672,96.541939
...,...,...,...,...,...,...,...,...,...,...,...,...
99995,2004-06-18 04:00:00+00:00,0.0,0.0,fedex,FDX,logistics,usa,957800.0,66.332358,66.936849,65.991798,66.604805
99996,2022-06-24 04:00:00+00:00,0.0,0.0,hershey company,HSY,food & beverage,usa,1154300.0,213.073770,216.005978,212.311397,215.966888
99997,2017-09-06 04:00:00+00:00,0.0,0.0,unilever,UL,consumer goods,netherlands,1079900.0,47.701035,47.904677,47.546265,47.880238
99998,2021-06-04 04:00:00+00:00,0.0,0.0,amazon,AMZN,e-commerce,usa,44994000.0,160.600006,161.050003,159.940506,160.311005


In [89]:
df.isna().sum()

Date             16
Dividends        59
Stock Splits    458
Brand_Name      225
Ticker          175
Industry_Tag     34
Country         100
Volume          852
Open            803
High            664
Low             681
Close           194
dtype: int64

In [90]:
#drop the 16 rows without any date data
df = df.dropna(subset=['Date'])

In [91]:
df

Unnamed: 0,Date,Dividends,Stock Splits,Brand_Name,Ticker,Industry_Tag,Country,Volume,Open,High,Low,Close
0,2021-01-25 05:00:00+00:00,0.0,0.0,crocs,CROX,footwear,usa,1102500.0,73.180000,74.750000,71.050003,73.910004
1,2019-09-12 04:00:00+00:00,0.0,0.0,target,TGT,retail,usa,3185700.0,100.810594,101.077753,99.999911,100.359192
2,2015-12-29 05:00:00+00:00,0.0,0.0,unilever,UL,consumer goods,netherlands,1278700.0,33.785507,33.970634,33.700657,33.908924
3,2014-06-13 04:00:00+00:00,0.0,0.0,amd,AMD,technology,usa,17734600.0,4.360000,4.390000,4.240000,4.280000
4,2017-10-06 04:00:00+00:00,0.0,0.0,the walt disney company,DIS,entertainment,usa,4360200.0,96.426167,96.734885,95.837672,96.541939
...,...,...,...,...,...,...,...,...,...,...,...,...
99995,2004-06-18 04:00:00+00:00,0.0,0.0,fedex,FDX,logistics,usa,957800.0,66.332358,66.936849,65.991798,66.604805
99996,2022-06-24 04:00:00+00:00,0.0,0.0,hershey company,HSY,food & beverage,usa,1154300.0,213.073770,216.005978,212.311397,215.966888
99997,2017-09-06 04:00:00+00:00,0.0,0.0,unilever,UL,consumer goods,netherlands,1079900.0,47.701035,47.904677,47.546265,47.880238
99998,2021-06-04 04:00:00+00:00,0.0,0.0,amazon,AMZN,e-commerce,usa,44994000.0,160.600006,161.050003,159.940506,160.311005


In [92]:
df[df['Brand_Name'].isna() & df['Ticker'].isna()]

Unnamed: 0,Date,Dividends,Stock Splits,Brand_Name,Ticker,Industry_Tag,Country,Volume,Open,High,Low,Close
29955,2000-11-16 05:00:00+00:00,0.0,0.0,,,apparel,usa,17245350.0,4.804951,5.124243,4.804951,5.038579


In [93]:
#drop the row without both Brand_name and Ticker as we cannot say which company this entry refers to
df = df[~(df['Brand_Name'].isna() & df['Ticker'].isna())].copy()

In [94]:
df

Unnamed: 0,Date,Dividends,Stock Splits,Brand_Name,Ticker,Industry_Tag,Country,Volume,Open,High,Low,Close
0,2021-01-25 05:00:00+00:00,0.0,0.0,crocs,CROX,footwear,usa,1102500.0,73.180000,74.750000,71.050003,73.910004
1,2019-09-12 04:00:00+00:00,0.0,0.0,target,TGT,retail,usa,3185700.0,100.810594,101.077753,99.999911,100.359192
2,2015-12-29 05:00:00+00:00,0.0,0.0,unilever,UL,consumer goods,netherlands,1278700.0,33.785507,33.970634,33.700657,33.908924
3,2014-06-13 04:00:00+00:00,0.0,0.0,amd,AMD,technology,usa,17734600.0,4.360000,4.390000,4.240000,4.280000
4,2017-10-06 04:00:00+00:00,0.0,0.0,the walt disney company,DIS,entertainment,usa,4360200.0,96.426167,96.734885,95.837672,96.541939
...,...,...,...,...,...,...,...,...,...,...,...,...
99995,2004-06-18 04:00:00+00:00,0.0,0.0,fedex,FDX,logistics,usa,957800.0,66.332358,66.936849,65.991798,66.604805
99996,2022-06-24 04:00:00+00:00,0.0,0.0,hershey company,HSY,food & beverage,usa,1154300.0,213.073770,216.005978,212.311397,215.966888
99997,2017-09-06 04:00:00+00:00,0.0,0.0,unilever,UL,consumer goods,netherlands,1079900.0,47.701035,47.904677,47.546265,47.880238
99998,2021-06-04 04:00:00+00:00,0.0,0.0,amazon,AMZN,e-commerce,usa,44994000.0,160.600006,161.050003,159.940506,160.311005


In [95]:
df['Brand_Name'] = df['Brand_Name'].str.lower()
df['Ticker'] = df['Ticker'].str.upper()

brand_to_ticker = df.dropna(subset=['Brand_Name', 'Ticker'])\
                        .drop_duplicates(subset=['Brand_Name'])\
                        .set_index('Brand_Name')['Ticker'].to_dict()

ticker_to_brand = df.dropna(subset=['Brand_Name', 'Ticker'])\
                        .drop_duplicates(subset=['Ticker'])\
                        .set_index('Ticker')['Brand_Name'].to_dict()

    # Fill missing Ticker using Brand_Name
df.loc[df['Ticker'].isna() & df['Brand_Name'].notna(), 'Ticker'] = (
        df.loc[df['Ticker'].isna() & df['Brand_Name'].notna(), 'Brand_Name']
        .map(brand_to_ticker)
    )

    # Fill missing Brand_Name using Ticker
df.loc[df['Brand_Name'].isna() & df['Ticker'].notna(), 'Brand_Name'] = (
    df.loc[df['Brand_Name'].isna() & df['Ticker'].notna(), 'Ticker']
    .map(ticker_to_brand))

In [96]:
df.isna().sum() 

Date              0
Dividends        59
Stock Splits    458
Brand_Name        0
Ticker            0
Industry_Tag     34
Country         100
Volume          852
Open            803
High            664
Low             681
Close           194
dtype: int64

In [97]:
print(df['Ticker'].value_counts())

Ticker
MCD     2195
KO      2178
TGT     2175
COST    2167
MAR     2161
        ... 
PTON     376
ZI       300
ABNB     246
RBLX     234
COIN     220
Name: count, Length: 61, dtype: int64
