In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

# Project Summary

Provide a clear explanation of:
  1. What this project is about
     - Clearly state the ML type (supervised/unsupervised)
  2. The goal of the project
     - E.g. why it’s important, what goal you want to achieve, or want to learn.

# Data Summary

Data Source:
1. Identify where the data came from (using the APA format)
2. Explain how the dataset was gathered (via API, csv, etc.)

Data Description:
1. Create a table to provide a description of each feature (at least some key features if too many)
2. Print out the results of df.info() to display:
   1. Number of samples/rows and the number of features/columns
   2. Data types of each feature (or just a summary if too many features- e.g. 10 categorical, 20 numeric features)
   3. Bytesize (if a huge file)

Univariate Visualizations (provide clear explanations of what the purpose of each visualization is):
1. Missing values heatmap to identify:
   1. Imbalances between features
   2. Which features need to have missing data imputed or even have the feature dropped completely
2. Duplicate values heatmap to identify where rows are duplicated
3. Boxplots and/or histograms of numerical features to identify:
   1. How the distributions of each feature compare to each other
   2. Where outliers exist
4. For categorical features create a bar chart of the frequency/count of occurance for each category

Conclusions/Discussions/Next Steps:
1. Summarize the steps taken to describe the dataset
2. Identify any insights/findings made while describing the dataset
3. Give a brief description of what the next step will be in the analysis (data cleaning)

# Data Cleaning

Data cleaning (provide clear explanations of why each step is being applied to the dataset):
1. Convert data types
2. Create new columns that will help with the analysis such as:
   1. Adding datetime features (year, month, day, quarter, date, etc.)
   2. Pivoting columns
   3. Grouping rows by features
3. Renaming column headers
4. Filter/subset the dataset
5. Apply methods to imputing missing values, or drop the feature altogether if its not important to the analysis
6. Apply methods to remove outliers

Conclusions/Discussions/Next Steps:
1. Summarize the steps taken to clean the dataset
2. Identify any insights/findings made while cleaning (including any foreseen difficulties that could occur during analysis)
3. Give a brief description of what the next step will be in the analysis (EDA)

# EDA

Multivariate Visualizations (Using colors, size, or faceted by categories where applicable. Also provide clear explanations of what the purpose of each visualization is):
   1. Correlation matrix
   2. Bi-variate histograms (i.e. sns.pairplot())
   3. Scatter plots
   4. Line charts

Conclusions/Discussions/Next Steps:
1. Summarize the steps taken to clean the dataset
2. Identify any insights/findings made while cleaning (including any foreseen difficulties that could occur during analysis)
3. Give a brief description of what the next step will be in the analysis (Modeling)

# Modeling

- Use multiple (appropriate) ML models
  - use models not covered in class
  - Is the choice of model(s) appropriate for the problem?
- Interaction/collinearity between features
  - Is there interaction/collinearity between features that can be a problem for the choice of the model?
  - Does the author properly treat if there is interaction or collinearity (e.g., linear regression)? Or does the author confirm that there is no such effect with the choice of the model?
- Feature importance
  - Investigate which features are important by looking at feature rankings or importance from the model
- Hyperparameter tuning
- Managing data imbalance
  - Regularization or other training techniques such as cross validation, oversampling/undersampling/SMOTE or similar for managing data imbalance

# Results and Analysis

- A summary of results and analysis which includes:
  1. Proper visualizations (E.g., tables, graphs/plots, heat maps, statistics summary with interpretation, etc.)
  2. Use various evaluation metrics (E.g., if your data is imbalanced, there are other metrics (F1, ROC, or AUC) that are better than mere accuracy).
     1. Explain why they chose the metric?
  3. Iterate the training and evaluation process and improve the performance
     1.  Address selecting features through the iteration process
  4. Compare the results from the multiple models and make appropriate comparisons

# Discussion & Conclusion

1. Learning and takeaways
2. Why something didn’t work
3. Suggest ways to improve