# Summary

To make real progress along the path toward becoming an expert in the data mining field, it is helpful to apply all your data mining knowledge to a project task. This project will require you to explore the actual data, gain experience implementing data mining projects and become industry-ready.

# Instruction
- browse the internet for the topics covered in the data mining field and choose ONE topic that best suits your group.
- discuss with your colleagues the problems/ issues to solve and select the MESSY/ NOISE data (may extract several datasets) from established sources from websites such as GitHub, Kaggle, UCI repository, etc. The dataset must be reflected/contribute to the real-world problem you will solve. Each group should have different problems to solve.
- utilise python or any analytics tools to develop data mining models and solve the problem. Additional tools to be used are recommended.
  

# Requirements
The project requires all the followings:
1. **Executive Summary**
   
    Description of the selected project, problem to be solved, and basic description of the data set from the sources.

2. **Summary of the Project Context and Objectives**
   
    Summarise your project context and list the objectives covered in your project.

3. **Methodology**

    Create a data pipeline to depict your sequence of actions that move data from a source to a destination. Show and explain all phases or steps involved in the data mining process, including the data preprocessing (ETL/ ELT), modelling, evaluation and deployment in your project development.

4. **Results and Discussion**

    Produce and explain all codes, GUI, graphs, diagrams, or any visualisation output from the project development.

5. **Conclusion**

    Conclude your project and discuss some contributions to society, the environment, systems, etc.

6. **References**

    Cite every single reference used in completing the project. 


# Data Description

In this competition you are predicting the probability that an online transaction is fraudulent, as denoted by the binary target isFraud. 
For each TransactionID in the test set, you must predict a probability for the isFraud variable.

The data is broken into two files identity and transaction, which are joined by TransactionID. Not all transactions have corresponding identity information.

1. [Further Information and related discussion](https://www.kaggle.com/c/ieee-fraud-detection/discussion/101203)
2. [Kaggler's insight](https://www.kaggle.com/c/ieee-fraud-detection/discussion/101203#610146)
3. [Labelling logic](https://www.kaggle.com/c/ieee-fraud-detection/discussion/101203#589276)

    It's a good question.

    Yes, they're all real data, no synthetic data. The logic of our labeling is define reported chargeback on the card as fraud transaction (isFraud=1) and transactions posterior to it with either user account, email address or billing address directly linked to these attributes as fraud too. If none of above is reported and found beyond 120 days, then we define as legit transaction (isFraud=0).
    However, in real world fraudulent activity might not be reported, e.g. cardholder was unaware, or forgot to report in time and beyond the claim period, etc. In such cases, supposed fraud might be labeled as legit, but we never could know of them. Thus, we think they're unusual cases and negligible portion.
4. [Main Ideas from Grandmaster's EDA](https://www.kaggle.com/code/cdeotte/eda-for-columns-v-and-id/notebook)

Transaction Table *
- TransactionDT: timedelta from a given reference datetime (not an actual timestamp)
- TransactionAMT: transaction payment amount in USD
- ProductCD: product code, the product for each transaction
- card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
- addr: address
- dist: distance
- P_ and (R__) emaildomain: purchaser and recipient email domain
- C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
- D1-D15: timedelta, such as days between previous transaction, etc.
- M1-M9: match, such as names on card and address, etc.
- Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.

Categorical Features:
- ProductCD
- card1 - card6
- addr1, addr2
- P_emaildomain
- R_emaildomain
- M1 - M9

Identity Table *

Variables in this table are identity information – network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions.
They're collected by Vesta’s fraud protection system and digital security partners.
(The field names are masked and pairwise dictionary will not be provided for privacy protection and contract agreement)

Categorical Features:
- DeviceType
- DeviceInfo
- id_12 - id_38