## Q1

After dropping all the non-residental homes and the ID column, I think I spent a considerable amount of time deciding what constituted as low cost renovations and hence renovatable features that should be removed - I think I have become a bit of an expert in how to boost the value of your home without it costing a kidney. 

A lot of columns were straightforward but there were a few that were debatable. 
- Basement Unfinished Square Feet I decided to remove as finishing an unfinished area has more upside than down and should be relatively easy to do up. 
- Masonry veneer type and area I decided to remove as as a bit of googling showed that it was actually really easy to change and could really boost the value of a house. 

**Next, I wanted to remove all the columns that had 90% of the same values** - the logic being that surely such a column wouldn't really play a huge role in predicting the price as most houses would have this value.
Some examples:
- Kitchens Above Grade, most houses had kitchen above ground level, save for a minority. 
- Miscallaneous Features and their respective values. As you can imagine, very few houses actually had these so again unlikely to be useful for predicting generic houses but thought it would be interesting to explore later on (if I have time) to see what part they could play in prediciting values for those outlier houses. 

**Next came the fun task of deciding what to do with Null values.**
- Lot Frontage: There were 259 entries with no linear feet of street connected to the property. I assumed this would be a mistake (how can you have a property with no street access?) and filled these with the median value (preferred over mean due to skew) of linear street in feet based on MSSubClass (a.k.a what kind of property). 
- For some of the variables such as Basement Quality (height), Basement Exposure and GarageType, the `pd.read_csv()` read in NA as NaN so I just changed these values back to NoB and NoG. 
- Garage Year built was an interesting one. I tried a few different things with this one but the best for the score was to match the year the house was built. I didn't like this solution much as it doesn't capture that fact that there is no garage but given it was 79 values and GarageType captured no garage, I decided to stick to this. Other methods I tried factorizing and setting the null values to 0. 

**Feature engineering!** Now this was fun and also very arbitary. I tried a lot of things for this section. I will talk about the things that worked. 
- I tried plotting some of the categorical variables to see if I can identify some of of scale I can give them. Factorizing and then squaring Basement Exposure worked as getting 'Gd' as a score was worth more than an increment of 1 vs a score of 'Av'. 
- After noticing that the square feet for basement, ground floor living area, 1st and 2nd floor had massive impact, I added up together to form 'total_sqft' as a column. Them I dropped those columns as well as Total Rooms Above Grade and Basement Finished Square Feet as these were very correlated. This helped to boost the final score. 

**Next I checked all the independent variables and dependent variable for skew.** Only Lot Area was very skewed with a value of ~12.2. The dependent variables, SalePrice, also didn't have a normal distribution. So I took the natural log of both. 

Finally the data was ready to be fed into the models. After dummifying all the categorical variables, including MSSubClass, splitting the data into train and test set according to the year sold and standarding the data, **I ran multiple linear regression models - No regularization, Lasso, Ridge and ElasticNet - and....Ridge is a Smidge better! - get it?** 

I did also test ElasticNet but it returned an l1-ratio of 1 so it would be the same as testing Lasso essentially. 

I optimised my Ridge model by iterating through many coefficient cut-off thresholds and picking the best mean cv score. I did this by dropping all the coefficients that were below each threshold in an for loop. **My optimal Ridge model dropped 62 columns (out of 117) using a threshold of 0.006221 with a mean score of ~0.8578.**

Given my model is a log-linear regression model, meaning $ln(Y) = \beta_0 + \beta_1 X_1 +...+ \beta_n X_n$, the way to interpret my coefficients is that for each standard deviation rise in $X_i$, the value of $Y$ increase by $e^{\beta_i}$. E.g. total_sqt had a coefficient of ~0.09. $e^{0.09} = ~1.094$. Meaning 1 std rise in total_sqt causes the sale price to multiply by ~1.094 or increase by ~9.41%. **I have demonstrated this for all the coefficients by visualing them as their percentage impact on the price for both the optimal Lasso and Ridge models.**



## Q2 

I extracted the residuals from my optimal Ridge model by predicting the $ln$(SalePrice) and taking the difference with the log of the actual values. **My residuals were more or less normally distributed - a good sign given this is one of the assumptions of linear reg and works well for the next stage.**

My independent variables were all of the variables I hadn't included above. The data cleaning and EDA were very similar to the steps taken above - removed columns with 90% of the same values, fixed a few errors (basement) and filled in the Null values with relevant tags (a lot of columns with Null values were mistakes from reading NA as NaN so this part was easy). 

Checking the correlation and general scatterplots of categorical variables didn't really offer much insight so I included all of the variables to be dummified and in the models. I did play around with some polynomial tendencies of the variables and found that raising the OverallCond to the power 1.5 worked well. 

After the usual steps, **I again ran multiple linear regression models - No regularization, Lasso, Ridge and ElasticNet - and....Ridge is a Smidge better again - so glad I get to say this twice!** 

I used the same optimisation techniques before and found that **my optimal Ridge model dropped 83 columns (out of 97) using a threshold of 0.008929 with a mean cv score of ~0.1990.**

Now this isn't great by any measure. But it does make sense as it is trying to estimate the error accounted for by the renovatable features in the residuals of an imperfect model. Interpreting the coefficients (I have once again plotted them as a percentage of their impact of the price), we can see that the Overall Condition rating has the biggest impact; accounts for ~20% increase in the part of the price accountable for renovatable features for 1 std rise in the rating. 

Using this model we can see that by increasing the rating of house (whether or Condition or Quality metrics), through external finishes (you can see what these are in more detail in the pre-optimised visualisation of Ridge coefficients e.g. Exterior type: BrickFace seems to be positively weighted for a high valuation), we can boost the value of the house. **However, this model is not at all reliable given the low cv score and the fact that its estimating off the residuals. So it is hard to be sure whether to use this model to decide what to renovate.** 

Maybe it would have provided more accurate details to include these renovatable features in the original model to assess their impact on the overall price to decide if worth renovting.

## Q3

I have enough difficultly understanding a 2 class problem so the first thing I did was convert this 6 class problem into a 2 class problem. 

**The baseline accuracy for this problem is 93.4% so clearly a huge class imbalance and will need lots of fun techniques for the model to have a good recall score.** This is the score I decided to optimise seeing as we don't false negatives (a.k.a miss houses that will have an abnormal sale) as these will cost the bank a lot more money and hassle. 

Since the data was always cleaned in the previous two steps, I decided to jump straight into the EDA and feature selection. The correlation matrix showed litte correlation between the Sale Category (Abnormal & Normal). So to start with I decided to throw all of the features into the matrix. 

I ran a GridSearchCV model to test both Ridge and Lasso with various C values. Lasso (l1) was the best with a accuracy score of ~93.6%. This at first made me very happy and then I realised 2 things: 1) the baseline is ~93.4%; 2) this was optimisng accuracy. Looking at the confusion matrix I see that the model always predicted it to be the majority class, Normal. 

I decided to rerun my logreg model, only Lasso as it came out better with GridSearch and to minimise the time spent on running both models, using precision and recall scoring instead of accuracy to see if the model would predict any abnormal cases - still came up with nothing. I also tested this KNN algo but still returned nothing. 

Next I tried to try under/oversampling. The issues with random simple over and undersampling lies in the fact that oversampling of the minorisks overfitting and undersampling on majority class risks losing important data. **So I decided to jump right in and try SMOTE: Synthetic Minority Oversampling Technique.**

It works by generating synethic data based on the existing feature space similiarities between the minority class. It uses KNN for each of the minority class datapoints, selects a random neighbour and interpolates this information to produce another datapoint in the neighbourhood.

However, I had some issues with installing the imblearn package which meant I had to spend a while uninstalling and reinstalling clashing files (? I still don't know how exactly to resolve this...but trying a few things). So I am submitting what I have so far but the techniques I read about are quite interesting so I will continue working on the project (or maybe use it on the next one...got a bit tired of the problem towards the end) but would have liked to also experiment with linear Support Vector Machines, decision trees and Adaptive Synthetic Sampling (ADASYN). I also would like to do some more research on how to select the best features, such as the ANOVA feature selection technique you mentioned, as I find this process quite arbitary and would like to see what others generally do. 
