### CSCI 4622 Final Project Write-Up

Myeongseon Lee, Vinayak Sharma, Jaskrit Singh, Joshua Sun

### 1. Project topic

* Is there a clear explanation of what this project is about? Does it state clearly which type of problem (e.g. type of learning and type of task)?
* Does it state the motivation or the goal (or why it’s important, what goal the team wants to achieve, or want to learn) clearly?
* (Extra credit) Is the project topic creative? Requires collecting data (e.g. scraping)?

Motivation: Our project is motivated by a project found on Kaggle. The dataset consists of weather data, with the overall goal of predicting drought. Drought is measured continuously in our data, but technically can be a categorical variable. We'll treat it as both in this project. Our goals are to implement the knowledge we've gained from this class in order to model our data. 

This project is supervised learning, for classification (and regression).

### 2. Data

* Is the data source properly quoted and described? (including links, brief explanations)
* Do they explain the data description properly? The data description can include the data size
    * e.g. for tabulated data: number of samples/rows, number of features/columns, byte size if a huge file, data type of each feature (or just a summary if too many features e.g. 10 categorical, 20 numeric features), description of features (at least some key features if too many), whether the data is a multi-table form or gathered from multiple data source.
    * e.g. for images: you can include how many samples, number of channels (color or gray or more?) or modalities, image file format, whether images have the same dimension or not, etc.
    * e.g. sequential data: texts, sound file; please describe appropriate properties such as how many documents or words, how many sound files with typical length (are they the same or variable), etc.

### 3. Data cleaning

Example score breakdown for tabulated data format: no cleaning 0 pts (if the data was given perfectly cleaned, just give a default score of 5 pts), data types munging +1, drop NA +1, impute +1, identify imbalance +1, and identify data-specific potential problem +1.
* Does it include clear explanations on how and why cleaning is performed?
    * (e.g.) the author decided to drop a feature because it had too many NaN values and the data cannot be imputed.
    * (e.g.) the author decided to impute certain values in a feature because the number of missing values was small and he/she was able to find similar samples OR, he/she used an average value or interpolated value, etc.
    * (e.g.) the author removed some features because there are too many of them and they are not relevant to the problem, or he/she knows only a few certain features are important based on their domain knowledge judgment.
    * (e.g.) the author removed a certain sample (row) or a value because it is an outlier.
    * (e.g.) if the project is on text data, is stopword filtering conducted? If no, why not?
* Does it have conclusions or discussions? E.g. the data cleaning summary, findings, discussing foreseen difficulties, and/or analysis strategy.
* Does it have a proper visualization?

In [4]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import date
from datetime import datetime
%matplotlib inline

In [10]:
# change to whatever the path is 
path = '../archive/'

In [11]:
train = pd.read_csv(path+'train_timeseries.csv')
valid = pd.read_csv(path+'validation_timeseries.csv')
test = pd.read_csv(path+'test_timeseries.csv')

In the following code, we will interpolate the drought data.

In [12]:
# training data preprocessing
train['date'] = pd.to_datetime(train['date']) # parse date.
train['score'] = train['score'].apply(pd.to_numeric).interpolate()
#train['drought_level'] = np.floor(train['score']) # classify drought level.
#train.drop(columns=['fips', 'score'], inplace=True) # remove un-necessary columns.
train.dropna(inplace=True) # remove all examples containing NaN.

# validation data preprocessing
valid['date'] = pd.to_datetime(valid['date']) # parse date.
valid['score'] = valid['score'].apply(pd.to_numeric).interpolate()
#valid['drought_level'] = np.floor(valid['score']) # classify drought level.
#valid.drop(columns=['fips', 'score'], inplace=True) # remove un-necessary columns.
valid.dropna(inplace=True) # remove all examples containing NaN.

# testing data preprocessing
test['date'] = pd.to_datetime(test['date']) # parse date.
test['score'] = test['score'].apply(pd.to_numeric).interpolate()
#test['drought_level'] = np.floor(test['score']) # classify drought level.
#test.drop(columns=['fips', 'score'], inplace=True) # remove un-necessary columns.
test.dropna(inplace=True) # remove all examples containing NaN.

### 4. Exploratory Data Analysis

Example score breakdown for tabulated data format: no cleaning 0 pts (if the data was given perfectly cleaned, just give a default score of 5 pts), data types munging +1, drop NA +1, impute +1, identify imbalance +1, and identify data-specific potential problem +1.

* Does it include clear explanations on how and why cleaning is performed?
 
    * (e.g.) the author decided to drop a feature because it had too many NaN values and the data cannot be imputed.
    * (e.g.) the author decided to impute certain values in a feature because the number of missing values was small and he/she was able to find similar samples OR, he/she used an average value or interpolated value, etc.
    * (e.g.) the author removed some features because there are too many of them and they are not relevant to the problem, or he/she knows only a few certain features are important based on their domain knowledge judgment.
    * (e.g.) the author removed a certain sample (row) or a value because it is an outlier.
    * (e.g.) if the project is on text data, is stopword filtering conducted? If no, why not?
* Does it have conclusions or discussions? E.g. the data cleaning summary, findings, discussing foreseen difficulties, and/or analysis strategy.
* Does it have a proper visualization?

### 5. Models

Example score breakdown for typical supervised learning: 8 if a proper single model is used, +2 if addresses multilinear regression/collinearity for regression models, +2 feature engineering, +2 multiple ML models, +2 hyperparam tuning, +2 regularization, or other training techniques such as cross-validation, oversampling/undersampling or similar for managing data imbalance, +2 using models not covered from class.
* Is the choice of model(s) appropriate with the problem?
* Is the author aware of whether interaction/collinearity between features can be a
problem for the choice of the model and properly treat if there is interaction or collinearity (e.g. linear regression)? Or confirms that there is no such effect with the choice of the model?
* Did the author use multiple (appropriate) models?
* Did the author investigate which ones are important features by looking at feature
rankings or importance from the model? (Not by a judgment which we already covered in the EDA category) Did the author use techniques to reduce overfitting or data imbalance?
* Did the author use new techniques/models we didn't cover in the class?

### 6. Results and Analysis


Example score breakdown: showing basic result 12, with a good amount of visualization +2, try different evaluation metrics +2, iterate training/evaluating and improve performance +2, show/discuss model comparison +2
* Does it have a summary of results and analysis?
* Does it have a proper visualization? (e.g. tables, graphs/plots, heat maps,
statistics summary with interpretation, etc)
* Does it use different kinds of evaluation metrics properly? (e.g. if your data is
imbalanced, there are other metrics F1, ROC, or AUC better than mere
accuracy). Also, does it explain why they chose the metric?
* Does it iterate the training and evaluation process and improve the performance?
Does it address selecting features through the iteration process?
* Did the author compare the results from the multiple models and did an
appropriate comparison?


### 7. Discussion and Conclusion

Example score breakdown: basic reiteration of the result 6, discussion on what are the learnings and takeaways, discussion on why something didn't work +2, suggesting ways to improve +2.