**This notebook is an exercise in the [Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/alexisbcook/data-leakage).**

---


# Data Leakage

- 머신러닝 예측 모델을 개발하는데 있어 큰 문제
- 트레이닝 데이터 밖에서 유입된 정보가 모델을 만드는데 사용되는 것을 의미

# Introduction

- Data Leakage는 training data가 target 변수에 대해 정보를 포함하면 발생한다. (비슷한 데이터가 예측에 사용되서는 안됨)
- 이러한 문제는 training set 에 대해서는 high performance를 불러와도, 모델은 perform poorly in production

- There are two main types of leakage: target leakage and train-test contamination. 

#### 1) Target Leakage
    - 예측시 사용할 수 없는 데이터가 예측 변수에 포함된 경우 발생
    - 데이터를 이용하는 시기 또는 시간 순으로 feature가 좋은 prediction에 도움이 되는지 안되는지를 고려하는 것이 매우 중요함![image.png](attachment:image.png)
    - 예제에서는 pneumonia를 걸리고 나서 항생제를 먹으므로 took_antibiotic_medicine 변수는 옳지 않음!!
    - validation data도 training data에서 holdout 시킨 동일한 source이므로, 좋은 validation은 될 것이지만, 
    - 실생활에 적용한다면 매우 부적절하다!!
    - 폐렴에 걸릴 환자도 추후 예측이 필요할 때 아직 항생제를 투여하지 않았기 때문
    - 따라서, target attribute가 실제로 발생한 뒤에 업데이트(생성)된 변수는 제외시킬 것!!!!
#### 2) Train-Test contamination
    - validation data로부터 training data를 구별하지 못할 때 주의하지 않으면 leak 발생
    - 실제로는 validation은 모델이 전에 존재하지 않았던 데이터에 대해 측정하는 것을 의미하는데, 
    - validation data가 전처리에 영향을 받는지 안 받는지에 따라 이러한 과정에 방해 될 수 있음
    
    - 예를 들어, train test split 하기 전에, 전처리로 결측치 대체해버리면 모델은 valid data도 좋은 data가 되어 성능은 좋치만 배포 시, 안 좋을 수 있음
    - 이러한 문제는 더 복잡한 feature enginnering 할 때 더 위험함
    - validation이 간단히 train-test split으로만 된다면, 전처리 단계의 fitting을 포함시켜 모든 유형의 fitting에서 검증 데이터를 제외한다.
    - 이는 사이킷런의 pipeline에서 쉽게 적용 가능,또한 cross-validation 할 때 pipeline 안에 전처리를 하기에 더 critical 이 됨

Most people find target leakage very tricky until they've thought about it for a long time.

So, before trying to think about leakage in the housing price example, we'll go through a few examples in other applications. Things will feel more familiar once you come back to a question about house prices.

# Setup

The questions below will give you feedback on your answers. Run the following cell to set up the feedback system.

In [1]:
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex7 import *
print("Setup Complete")

Setup Complete


# Step 1: The Data Science of Shoelaces

- NIKE : 신발 재료에 대한 비용 절약하기 위함
- 매월 필요한 신발 끈 수를 예측하기 위해 만든 모델을 검토

Nike has hired you as a data science consultant to help them save money on shoe materials. Your first assignment is to review a model one of their employees built to predict how many shoelaces they'll need each month. The features going into the machine learning model include:
- The current month (January, February, etc)
- Advertising expenditures in the previous month
- Various macroeconomic features (like the unemployment rate) as of the beginning of the current month
- The amount of leather they ended up using in the current month
    - 사용하는 가죽의 양은 생산하는 신발의 수를 나타내는 완벽한 지표라면 data leakage
    
The results show the model is almost perfectly accurate if you include the feature about how much leather they used. But it is only moderately accurate if you leave that feature out. You realize this is because the amount of leather they use is a perfect indicator of how many shoes they produce, which in turn tells you how many shoelaces they need.

Do you think the _leather used_ feature constitutes a source of data leakage? If your answer is "it depends," what does it depend on?

After you have thought about your answer, check it against the solution below.

In [2]:
# Check your answer (Run this code cell to receive credit!)
q_1.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> 

This is tricky, and it depends on details of how data is collected (which is common when thinking about leakage). Would you at the beginning of the month decide how much leather will be used that month? If so, this is ok. But if that is determined during the month, you would not have access to it when you make the prediction. If you have a guess at the beginning of the month, and it is subsequently changed during the month, the actual amount used during the month cannot be used as a feature (because it causes leakage).

# Step 2: Return of the Shoelaces

You have a new idea. You could use the amount of leather Nike ordered (rather than the amount they actually used) leading up to a given month as a predictor in your shoelace model.

Does this change your answer about whether there is a leakage problem? If you answer "it depends," what does it depend on?\\

- 신발 끈을 먼저 주문하는지 가죽을 먼저 주문하는지에 따라 달라지는 상황
- 1. 신발 끈 먼저 주문 => 신발 끈 예측할 때 얼마나 많은 가죽을 주문했는지 알 수 없음
- 2. 가죽 먼저 주문 => 신발 끈을 주문할 때 해당 번호를 사용할 수 있음

In [3]:
# Check your answer (Run this code cell to receive credit!)
q_2.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> 

This could be fine, but it depends on whether they order shoelaces first or leather first. If they order shoelaces first, you won't know how much leather they've ordered when you predict their shoelace needs. If they order leather first, then you'll have that number available when you place your shoelace order, and you should be ok.

# Step 3: Getting Rich With Cryptocurrencies?

You saved Nike so much money that they gave you a bonus. Congratulations.

Your friend, who is also a data scientist, says he has built a model that will let you turn your bonus into millions of dollars. Specifically, his model predicts the price of a new cryptocurrency (like Bitcoin, but a newer one) one day ahead of the moment of prediction. His plan is to purchase the cryptocurrency whenever the model says the price of the currency (in dollars) is about to go up.
- 암호화폐 예측 (모델이 달러 올라갈 것이라고 말하면 암호화폐 구입하자)

The most important features in his model are:
- Current price of the currency 화폐의 현재 가격
- Amount of the currency sold in the last 24 hours 지난 24 시간 동안 판매 된 화폐의 금액
- Change in the currency price in the last 24 hours지난 24 시간 동안의 화폐 가격 변경
- Change in the currency price in the last 1 hour 지난 1 시간 동안의 화폐 가격 변경
- Number of new tweets in the last 24 hours that mention the currency 지난 24 시간 동안 화폐 언급 한 새 트윗 수

The value of the cryptocurrency in dollars has fluctuated up and down by over \$100 in the last year, and yet his model's average error is less than \$1. He says this is proof his model is accurate, and you should invest with him, buying the currency whenever the model says it is about to go up.

Is he right? If there is a problem with his model, what is it?

# 정답

- 위의 변수들은 예측하려는 대상이 결정된후 학습 데이터에서 변경될 가능성이 없음

In [4]:
# Check your answer (Run this code cell to receive credit!)
q_3.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> 

There is no source of leakage here. These features should be available at the moment you want to make a predition, and they're unlikely to be changed in the training data after the prediction target is determined. But, the way he describes accuracy could be misleading if you aren't careful. If the price moves gradually, today's price will be an accurate predictor of tomorrow's price, but it may not tell you whether it's a good time to invest. For instance, if it is $100 today, a model predicting a price of $100 tomorrow may seem accurate, even if it can't tell you whether the price is going up or down from the current price. A better prediction target would be the change in price over the next day. If you can consistently predict whether the price is about to go up or down (and by how much), you may have a winning investment opportunity.

# Step 4: Preventing Infections

An agency that provides healthcare wants to predict which patients from a rare surgery are at risk of infection, so it can alert the nurses to be especially careful when following up with those patients.
- 감염위험에 처한 환자 추적하도록 서비스 개발
You want to build a model. Each row in the modeling dataset will be a single patient who received the surgery, and the prediction target will be whether they got an infection.
- 예측 대상 : 감염 여부
Some surgeons may do the procedure in a manner that raises or lowers the risk of infection. But how can you best incorporate the surgeon information into the model?

You have a clever idea. 
1. Take all surgeries by each surgeon and calculate the infection rate among those surgeons. : 각 외과에서 모든 수술 받아 외과의 감염률 계산
2. For each patient in the data, find out who the surgeon was and plug in that surgeon's average infection rate as a feature. : 데이터에서 각 환자에 대해 외과의 누구인지 확인하고, 외과의 평균 감염률로 연결

Does this pose any target leakage issues?
Does it pose any train-test contamination issues?

# 결과
1) target leakage : 환자의 결과가 그의 수술 감염률에 기여하면 leakage, 예측하려는 환자 이전의 수술만을 사용하여 외과의의 감염률을 계산하면 leakage 피할 수 있음!

2) train-test contamination : 모든 외과의사가 수행한 모든 수술을 사용하면 새로운 환자에게 실제로 적용시 일반화되지 않을 수 있음!
    <br> surgeon-risk feature는 test set의 데이터를 설명하는데 다루어지기 때문

In [6]:
# Check your answer (Run this code cell to receive credit!)
q_4.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> 

This poses a risk of both target leakage and train-test contamination (though you may be able to avoid both if you are careful).

You have target leakage if a given patient's outcome contributes to the infection rate for his surgeon, which is then plugged back into the prediction model for whether that patient becomes infected. You can avoid target leakage if you calculate the surgeon's infection rate by using only the surgeries before the patient we are predicting for. Calculating this for each surgery in your training data may be a little tricky.

You also have a train-test contamination problem if you calculate this using all surgeries a surgeon performed, including those from the test-set. The result would be that your model could look very accurate on the test set, even if it wouldn't generalize well to new patients after the model is deployed. This would happen because the surgeon-risk feature accounts for data in the test set. Test sets exist to estimate how the model will do when seeing new data. So this contamination defeats the purpose of the test set.

# Step 5: Housing Prices (주택 가격 예측 모델 구축)

You will build a model to predict housing prices.  The model will be deployed on an ongoing basis, to predict the price of a new house when a description is added to a website.  Here are four features that could be used as predictors.
- 웹 사이트에 설명이 추가될 때 새 집의 가격을 예측하기 위해 배포되는 모델
1. Size of the house (in square meters) - 집의 크기
2. Average sales price of homes in the same neighborhood - 같은 동네의 주택 평균 판매 가격
3. Latitude and longitude of the house - 집의 위도 경도
4. Whether the house has a basement - 집에 지하실 있는지

You have historic data to train and validate the model.

Which of the features is most likely to be a source of leakage?

1,3,4) 집의 크기, 위도, 경도, 지하실 유뮤: 팔린 후에도 변화없음<br>

2) 주택이 팔린 후 원시 데이터에서 업데이트되고 주택의 판매가 평균을 계산하는 데 사용되는 경우는 target leakage
    - 극단적으로 이웃에서 한 집만 팔리고 그것이 우리가 예측하려는 집이라면 평균은 우리가 예측하려는 가치와 정확히 일치 할 것이지만
    - 일반적으로 모델을 적용할 때 예측하는 집은 아직 판매되지 않았으니 제대로 작동하지 않을 것
    

In [8]:
# Fill in the line below with one of 1, 2, 3 or 4.
potential_leakage_feature = 2

# Check your answer
q_5.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> 

2 is the source of target leakage. Here is an analysis for each feature: 

1. The size of a house is unlikely to be changed after it is sold (though technically it's possible). But typically this will be available when we need to make a prediction, and the data won't be modified after the home is sold. So it is pretty safe. 

2. We don't know the rules for when this is updated. If the field is updated in the raw data after a home was sold, and the home's sale is used to calculate the average, this constitutes a case of target leakage. At an extreme, if only one home is sold in the neighborhood, and it is the home we are trying to predict, then the average will be exactly equal to the value we are trying to predict.  In general, for neighborhoods with few sales, the model will perform very well on the training data.  But when you apply the model, the home you are predicting won't have been sold yet, so this feature won't work the same as it did in the training data. 

3. These don't change, and will be available at the time we want to make a prediction. So there's no risk of target leakage here. 

4. This also doesn't change, and it is available at the time we want to make a prediction. So there's no risk of target leakage here.

In [None]:
#q_5.hint()
#q_5.solution()

# Conclusion
Leakage is a hard and subtle issue. You should be proud if you picked up on the issues in these examples.

Now you have the tools to make highly accurate models, and pick up on the most difficult practical problems that arise with applying these models to solve real problems.

There is still a lot of room to build knowledge and experience. Try out a [Competition](https://www.kaggle.com/competitions) or look through our [Datasets](https://kaggle.com/datasets) to practice your new skills.

Again, Congratulations!

---




*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum/161289) to chat with other Learners.*