# Data Wrangling Summary 

Write a summary statement that highlights the key processes and findings from this notebook. This should include information such as the original number of rows in the data, whether our own resort was actually present etc. What columns, if any, have been removed? Any rows? Summarise the reasons why. Were any other issues found? What remedial actions did you take? State where you are in the project. Can you confirm what the target feature is for your desire to predict ticket price? How many rows were left in the data? Hint: this is a great opportunity to reread your notebook, check all cells have been executed in order and from a "blank slate" (restarting the kernel will do this), and that your workflow makes sense and follows a logical pattern. As you do this you can pull out salient information for inclusion in this summary. Thus, this section will provide an important overview of "what" and "why" without having to dive into the "how" or any unproductive or inconclusive steps along the way.

## Original data information

- The original dataset cointained 330 rows and 27 columns including the variables:
      ['Name','Region', 'state', 'summit_elev', 'vertical_drop', 'base_elev',
       'trams', 'fastEight', 'fastSixes', 'fastQuads', 'quad', 'triple',
       'double', 'surface', 'total_chairs', 'Runs', 'TerrainParks',
       'LongestRun_mi', 'SkiableTerrain_ac', 'Snow Making_ac',
       'daysOpenLastYear', 'yearsOpen', 'averageSnowfall', 'AdultWeekday',
       'AdultWeekend', 'projectedDaysOpen', 'NightSkiing_ac']. 
   With a total of 636 missing values. 
      
- Our resort Big Mountain was present and was not missing any variables. 

- The data was organized by State and Region. Most State names were equal to the Region names (297) while some (33) were different. 

- Plotting a histogram of each variable revealed potential outliers and variables without an even distribution that would not be useful in out model      


In [None]:
original_data.hist(figsize=(15,10))
plt.subplots_adjust(hspace=0.5);

* 'SkiableTerrain_ac'
* 'Snow Making_ac'
* 'fastEight'
* 'fastSixes' 
* 'trams' 
* 'yearsOpen'
* 'fastQuads'

After careful examination, we were able to take care of the outliers in `yearsOpen` and `SkiableTerrain_ac`. We dropped `fastEight` as half its values were missing. We kept `fastSixes`, `trams` and `fastQuads` for now.

- All observations that were missing both target variables were dropped. 

In [None]:
original_data2.hist(figsize=(15, 10))
plt.subplots_adjust(hspace=0.5);

Most of the variable's distributions are lookign much better. 

A dataset of Population and area data for the US (obtained from wikipedia.com) was merged with the total sum of `SkiableTerrain_ac`, `daysOpenLastYear`, `TerrainParks`, `NightSkiing_ac` and `resort_per_state` to evaluate any possible relations with the state population and area.

- Upon closer examination of our tagert state (Montana), we observed that `AdultWeekend` and `AdultWeekday` have the same value across all resorts. Therefore, we dropped `AdultWeekday`, as it had the most missing values, and dropped all observations with missing values in `AdultWeekend`. 

- We ended up with a final dataset of 277 rows and 25 variables, as well as 299 missing values.

- At this point, we can not confirm what the predictor variables are, but we can confirm that `AdultWeekend` is our desired target variable.

# Exploratory Data Analysis Summary



Write a summary of the exploratory data analysis above. What numerical or categorical features were in the data? Was there any pattern suggested of a relationship between state and ticket price? What did this lead us to decide regarding which features to use in subsequent modeling? What aspects of the data (e.g. relationships between features) should you remain wary of when you come to perform feature selection for modeling? Two key points that must be addressed are the choice of target feature for your modelling and how, if at all, you're going to handle the states labels in the data.



- We explored the ratio of resort to state population and state area. This revealed some strong correlations as the states expected to lead in each variable were not present in the density top resorts. 

- To do away with this strong correlations, we applied PCA. In order to do so, we first scaled the data, fitted the PCA transformation and applied the trasnformation to create derived features. 

- We created a data frame with the two principal components, the average ticket price and created a categorical variable that categorized each observation in a quartile range.

- A scatter plot of this data frame revealed no patterns concerning the ticket price. This makes deciding to use the state labels a little unclear. On one hand, the state label seems revelant when it comes to the population (potential customers) and the ratio of resorts to the state area (competition). On the other hand, state labels showed no patterns when evaluated against ticket prices. I believe we should evaluate variables that show strong correlations with the target variable with and without the state labels and see what differences we can encounter.

- We converted the states' quantifiable variables into ratios for each observation (e.g `NightSkiing_ac`/`state_total_nightskiing_ac`)

- A correlation heatmap revealed some strong correlations between `AdultWeekend` and `fastQuads`, `Runs`, `Snow Making_ac`, `resort_night_skiing_state_ratio`, `vertical_drop` and `total_chairs`.



# Preprocessing & Training Summary

Write a summary of the work in this notebook. Capture the fact that you gained a baseline idea of performance by simply taking the average price and how well that did. Then highlight that you built a linear model and the features that found. Comment on the estimate of its performance from cross-validation and whether its performance on the test split was consistent with this estimate. Also highlight that a random forest regressor was tried, what preprocessing steps were found to be best, and again what its estimated performance via cross-validation was and whether its performance on the test set was consistent with that. State which model you have decided to use going forwards and why. This summary should provide a quick overview for someone wanting to know quickly why the given model was chosen for the next part of the business problem to help guide important business decisions.

- As our first step, we divided the data into training and test data.
- Second, we used the mean value as a baseline. As expected, its performance was very poor, with 𝑅2 = -0.00312 when compared against the test values. This translated to the model being off by an average of `$19.`
- Once our baseline was covered, we imputed missing values with both the mean and median, scaled the the predictors, and tried a linear regression model, this showed no significant difference between the two imputing alternatives.
- After, we applied the same test using the built-in pipelines sklearn offers. We had the exact same values in both methods.
- We proceeded to refine the linear model using SelectKBest to avoid overfitting the models with all the predictros. We performed a cross validation to avoid tuning the model to an arbitrary test set. 
- We also applied a grid search to automate the process of selecting the best k for SelectKBest. This return k = 8 as the best number of predictors: `vertical_drop`, `Snow Making_ac`, `total_chairs`, `fastQuads`, `Runs`, `LongestRun_mi`, `trams`, and `SkiableTerrain_a`c.
- We compared the linear regression model to a random forest model. We used the same grid search and got `fastQuads`, `Runs`, `Snow Making_ac`, `vertical_drop` as the best predictors for the model.
- We cross-validated both models using the best estimators and random forest was the best model with a MAE of `$9.5`, almost `$1` more accurate than the linear model.
- To finalize, we made sure our data set size was good enough for our prediction by using `learning_curve`. This determined a data set of around 40 to 50 observation was good enough to create a predictive model.