Below is a data mining and scripting exercise. Note that we will use it to evaluate :
- problem solving skills.
- machine learning skills; and,
- programming skills;
To Do :
- Follow the instructions below while maintaining a presentable (clean) script. Ideally we ask that you make your script available on your own github.com account (Use the free version here : https://github.com/).
- Send us the link to your final commit before 9am the day of your interview
- During the interview, we will ask that you present your work (preprocessing, model training, performance assessment, results & discussion).
We encourage you to present the results using either a notebook or a README file. At the very least, you should ensure
that your results are presentable.
Remember :
- Make sure to apply best practices as you move through the examples. (data preprocessing, missing values, hyper parameter search, model evaluation, result visualisation, etc.)
- Make assumptions where necessary, we are interested in your approach primarily.
- A good story is as important as an algorithm. We expect you to be able to communicate and present your ideas, methodology and implementations.
Good Luck!
The file fraud_prep.csv contains credit card transactions.
- Evaluate multiple classification algorithms to identify whether the transactions are fraudulent or not.
- Compare the performance of each model & identify the best performing one.
- Present how your model generalizes and performs on unseen data.
- Make sure to present all steps taken
BONUS Points : Can you think of some unsupervised methods to accomplish this same task? If so, describe them (do not script them)
The Crime Dataset contains 128 socio-economic features from the US 1990 Census. The target is the crime rate per community.
Ref. : https://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.names
Using the crime_prep.csv file :
- Identify the variables that are the most highly correlated with the target
- Apply either dimensionality reduction or feature selection on the dataset
- Evaluate multiple regression algorithms to predict the crime rate.
- Compare the performance of each model & identify the best performing one.
- Present how your model generalizes and performs on unseen data.