### My Data Science Process

**01** [Problem Statement](#01-Problem-Statement)
- [ ] Is it clear what the student plans to do?
- [ ] What type of model will be developed?
- [ ] How will success be evaluated?
- [ ] Is the scope of the project appropriate?
- [ ] Is it clear who cares about this or why this is important to investigate?
- [ ] Does the student consider the audience and the primary and secondary stakeholders?  

**02** [Data Collection and Cleaning](#02-Data-Collection-and-Cleaning)  
- [x] Are missing values imputed appropriately?
- [ ] Are distributions examined and described?
- [ ] Are outliers identified and addressed?
- [ ] Are appropriate summary statistics provided?
- [x] Are steps taken during data cleaning and EDA framed appropriately?
- [ ] Does the student address whether or not they are likely to be able to answer their problem statement with the provided data given what they've discovered during EDA?  

**03** [EDA & Pre-Processing / Feature Engineering](#03-EDA-and-Pre-Processing-/-Feature-Engineering)  
- [x] Are categorical variables one-hot encoded?
- [x] Does the student investigate or manufacture features with linear relationships to the target?
- [x] Have the data been scaled appropriately?
- [x] Does the student properly split and/or sample the data for validation/training purposes?
- [ ] Does the student utilize feature selection to remove noisy or multi-collinear features?
- [ ] Does the student test and evaluate a variety of models to identify a production algorithm (**AT MINIMUM:** linear regression, lasso, and ridge)?
- [x] Does the student defend their choice of production model relevant to the data at hand and the problem?
- [ ] Does the student explain how the model works and evaluate its performance successes/downfalls?

**04** [Modeling / Feature Selection](#04-Modeling-/-Feature-Selection)
- Refer to benchmarks from section 3

**05** [Evaluation and Conceptual Understanding](#05-Evaluation-and-Conceptual-Understanding)
- [x] Does the student accurately identify and explain the baseline score?
- [x] Does the student select and use metrics relevant to the problem objective?
- [ ] Is more than one metric utilized in order to better assess performance?
- [ ] Does the student interpret the results of their model for purposes of inference?
- [ ] Is domain knowledge demonstrated when interpreting results?
- [ ] Does the student provide appropriate interpretation with regards to descriptive and inferential statistics?

**06** [Conclusion and Recommendations](#06-Conclusion-and-Recommendations)
- [ ] Does the student provide appropriate context to connect individual steps back to the overall project?
- [ ] Is it clear how the final recommendations were reached?
- [ ] Are the conclusions/recommendations clearly stated?
- [ ] Does the conclusion answer the original problem statement?
- [ ] Does the student address how findings of this research can be applied for the benefit of stakeholders?
- [ ] Are future steps to move the project forward identified?

### Organization and Professionalism

**Project Organization**
- [x] Are modules imported correctly (using appropriate aliases)?
- [x] Are data imported/saved using relative paths?
- [ ] Does the README provide a good executive summary of the project?
- [x] Is markdown formatting used appropriately to structure notebooks?
- [x] Are there an appropriate amount of comments to support the code?
- [x] Are files & directories organized correctly?
- [ ] Are there unnecessary files included?
- [x] Do files and directories have well-structured, appropriate, consistent names?

**Visualizations**
- [ ] Are sufficient visualizations provided?
- [ ] Do plots accurately demonstrate valid relationships?
- [ ] Are plots labeled properly?
- [ ] Are plots interpreted appropriately?
- [ ] Are plots formatted and scaled appropriately for inclusion in a notebook-based technical report?

**Python Syntax and Control Flow**
- [ ] Is care taken to write human readable code?
- [ ] Is the code syntactically correct (no runtime errors)?
- [ ] Does the code generate desired results (logically correct)?
- [ ] Does the code follows general best practices and style guidelines?
- [ ] Are Pandas functions used appropriately?
- [ ] Are `sklearn` methods used appropriately?

**Presentation**
- [ ] Is the problem statement clearly presented?
- [ ] Does a strong narrative run through the presentation building toward a final conclusion?
- [ ] Are the conclusions/recommendations clearly stated?
- [ ] Is the level of technicality appropriate for the intended audience?
- [ ] Is the student substantially over or under time?
- [ ] Does the student appropriately pace their presentation?
- [ ] Does the student deliver their message with clarity and volume?
- [ ] Are appropriate visualizations generated for the intended audience?
- [ ] Are visualizations necessary and useful for supporting conclusions/explaining findings?

| Task | Description | % Complete |
| :--- | :----: | ---:|
| Problem Statement | | |
| Goal | | |
| Data Collection | | |

# **01** Problem Statement

### Problem & Goal
###### **PREDICTING FUTURE PLANE PRICES BY MARKET/ROUTE & MONTH (USA)**
- Often we are limited by the scope of our mind to determine travel destinations.  For instance, we pick a vacation destination we have knowledge about or a place our friends have visited, and then look into ways of executing that specific plan.  We end up spending  a significant amount of time fitting these plans to our budget.  

- But there are many other options.  Wouldn’t it be amazing to predict the highest value flights based on your specific budget and time window?
Our product aims to bring transparency into potential vacation destinations and offer an opportunity to identify a travel destination you may not have previously considered while adhering to your budget and time constraints.

###### **METHODS / MODELS**
- Linear Time Series Modeling
- Seasonal / ARIMA Modeling
- Mutlivariate Time Series Modeling (VAR)

###### **EVALUATION**
- Predicting Future Airline Prices by Market / Month
- Benchmark R2 - 20% increase over baseline model
- Benchmark MSE


# **02** Data Collection and Cleaning

### Data Collection
###### **Sources**
- [**Historical Jet Fuel Prices**](https://www.eia.gov/opendata/qb.php?sdid=PET.EER_EPJK_PF4_RGC_DPG.M) 
    - Data showcases the price of Jet Fuel in US Dollars.
    - Data separated by month.
    - Data collected ranges from April 1990 to August 2020.
- [**Top 1,000 Contiguous State City-Pair Markets**](https://data.transportation.gov/Aviation/Consumer-Airfare-Report-Table-1-Top-1-000-Contiguo/4f3n-jbg2)
    - Data showcases the average airfare per route separated by origin and destination city for the 48 USA landlocked states.
    - Data separated by quarter.
    - Data collected ranges from Q1 1996 to Q3 2019.
- [**US Domestic Flights**](https://academictorrents.com/details/a2ccf94bbb4af222bf8e69dad60a68a29f310d9a)
    - Data showcases the airline flight data including route by city, route by airport, passengers, number of flights, total seats available, distance, and population.
    - Data separated by month.
    - Data collected ranges from January 1990 to December 2009.

### Data Cleaning / Merging
- Clean 
    - Historical Jet Fuel Prices
        - Saved as variable 'fuel'
        - DatetimeIndex created
    - Top 1,000 Contiguous State City-Pair Markets
        - Saved as variable 'airfare'
        - DatetimeIndex created
        - Identify matching routes
        - Changed city names to match city names from different dataset
    - US Domestic Flights
        - Saved as variable 'flights'
        - DatetimeIndex created  
- Merge
    - Combine US Domestic Flights (left) & Top 1,000 Contiguous State City-Pair Markets.
    - Left join on route, quarter, and year to preserve shape of US Domestic Flights.
        - **Imputation:** airfare route pricing data was gathered on a quarterly basis, therefore the same value was imputed for each month of the corresponding quarter.
    - Resulting dataframe contain 381 different routes over 168 months.
        - Dataset is by month and ranges from the beginning of 1996 to end of 2009.



# **03** EDA and Pre-Processing / Feature Engineering

### Datasets
- We will be working with 3 datasets to begin which include:
    - df_combined: This includes all of our data (1996-2009)
    - df_train: This includes our df_combined dataset split into a training dataset (1996-2006)
    - df_test: This is going to our dataset of unseen data for which we will test how good our model actually is (2007-2009)
    
### Visualize Data
- Let's determine the top route by passenger volume over the life of our data as a baseline model
- Visualize Route Data Upon selection of this route we dive into EDA and visualize the route data
    - Lineplots
    - Decomposition Plots

### Pre-Processing / Feature Engineering
- Feature Engineering
    - General Features
- ACF & PACF Plots    
    - Trend / Seasonal Features
- Pre-Processing
    - AdFuller Test on each dataseries in our dataframes
    - Engineer features with stationary data

# **04** Modeling / Feature Selection
### OLS
- Build an OLS (Original Least Squares) linear regression model with optimal features on one route (1996-2006)
    - Analyze Statsmodel Summary and ensure features fall below a 10% threshold for pvalue
    - Iteration: Remove features & retest new model
- Apply our best model on all routes (1996-2006)
- Apply our best model on unseen data for each route (2007-2009)


### ARIMA


### VAR

# **05** Evaluation and Conceptual Understanding

# **06** Conclusion and Recommendations