### Problem Statement

This project was focused around how Big Mountain Resort (BMR) can capitalize more at its facilities as it expands and successfully grow its business by implementing a new price model for its ski resort tickets. This project aimed to build a predictive model for ticket prices based on the features and excursions present at a ski resort. The data for building this model included information on 330 resorts in the US that can be considered part of BMR same market share and were supplied by BMR Database Mananger, as well as geographical and population demographics available scraped from Wikipedia.

### Data Wrangling

The `AdultWeekday` feature was chosen as the target column ahead of the `AdultWeekend` as it contained more entries. Still, we were missing ticket prices for 16% of the resorts. This was a fundamental problem since we lack the required data for those resorts so those records were dropped.

More than 50% of the records had null values for the `fastEight` feature, which represents the number of fast 8 person chairs so that feature was dropped.

A few outliers were spotted with some of the numerical features. 
- `SkiableTerrain_ac` which represented the total scalable area in square acres had a value > 26,000 for one of the records.This outlier was fact checked on that company's website and a "*data correction*" was made. The true value was exactly 25,000 less.
- `yearsOpen` which represents the total number of years that the resort has been opened had a value  of 2019. This was most likely input error such that the resort opened in 2019. However, this row was dropped as assuming that it opened in 2019 would have been the youngest resort on the in the entire dataset.

State-wide summary statistics were derived for the market segment to investigate if supply and demand played a justifiable role to be factored into the pricing strategy.


### EDA

We observed some of the state-wide statistics first. Some of the numerical feature were grouped by `state` and then collapse to a descriptive statistic for that feature was derived for each `state` so that a principal component analysis(`PCA`) could be done to further reduce the dimension of these features into 2 components, which would be able to be represented on a 2D plot. 

For PCA analysis, the data was first scaled so that each feature was standardized to have a mean of 0 and unit standard deviation. By looking at the cumulative variance ratio of the first 2 PCA components, we were able to see that this accounted for more than 75% of the variance. of this subframe of the dataq. 

An annotated xy plot with these 2 PCA components, further dimensionalize by the binned median `AdultWeekend` ticket price per state (represented by the `hue` (color) on the plot) did not demonstrate any obvious distinction for categorizing `AdultWeekend` ticket price based on the `state`. However, we were able to learn which features could warrent more attention based on that feature's contribution to the 1st and 2nd `components_` attribute of the fitted `PCA` object. From these initial explorations, we considered it justifiable to not consider the `state` feature in the model, and to treat them all the same.

We also looked at the 'resort level data', and combined with some of the state level data features to create a few new features (feature engineering).Then looked at the correlation between all of the numerical features but mainly focused on which ones were highly correlated with the `AdultWeekend` ticket prices. The numerical features which stood out the most were the `Runs`, `fastQuads`, `vertical_drop` , `Snow Making_ac` and 'total_chairs`, each with a correlation coefficient of 0.65 or higher.


### Model Preprocessing and Feature engineering
Having some insight for which features we wanted to include in our model, it was time to take steps towards finally building a model. Our initial model sometimes referred to as a dummy model would use only the mean as a predictor.

To measure the performance of this initial model we use the **coefficient of determination** also referred to as the **r-squared (R^2)** value. Simply put, this is a metric that is used to measure how much the proportion of the variable of the dependent variable as explained by an independent variable or variables in a regression model. This gave us a base model and baseline resulst on which to improve upon. We also use the **mean absolute error (MAE)** and the **root mean square error (RSME)** as metrics as these give more easily interpretable values. 

Before any preprocessing is done we removed `Big Mountain Resort` (BMR) record from the dataset as we want to predict a price based on BMR's feature values. We had split the data into a train dataset and a test dataset before any preprocessing to prevent leakage, and only used the train dataset to fit the model. The test dataset was used to score the model and check for overfitting. Given that the dataset was small, we used cross-validation to improve our confidence as we assessed the model's performance.

For missing values we used `sklearn`'s `Simple Imputer`. `Simple Imputer` was able to fill in the mean or the median feature value for the missing/null values and gave us the option to try both then choose the hyperparameter which gave the higher score.

To organize the model preprocessing steps we used `scikit-learn`'s `Pipeline` module to build a `pipeline` object. A pipeline prevented us from leaking data From the test set into the model during training and also allowed us to insert other preprocessing steps or models into the pipeline itself and wrap the entire process in one object.This greatly streamline the model build-test process as a pipeline also accepts the same `.fit()` and `.predict()` method as any other `sklearn` model.  


### Algorithms

We created a pipeline which 
- imputed  missing values using `Simple Imputer` with the median value
- scales the data using `StandardScaler`
- selects the best 15 features using `SelectKBest`
- train and fit data to a `LinearRegression` model.
We then assessed the pipeline performance using cross validation


a linear model object and wrap the entire process  and because it also   This prevented any leakage during the preprocessing steps and A pipeline allowed us to test and access  how a model score varies as we alter the hyperparameters. Since a pipeline uses the same steps as  we use Quickly  which will allow us to streamline this hyperparameter tuning process as well as prevent leakage.  